{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Transfer Learning on TPUs\n",
    "\n",
    "In the <a href=\"3_tf_hub_transfer_learning.ipynb\">previous notebook</a>, we learned how to do transfer learning with [TensorFlow Hub](https://www.tensorflow.org/hub). In this notebook, we're going to kick up our training speed with [TPUs](https://www.tensorflow.org/guide/tpu).\n",
    "\n",
    "## Learning Objectives\n",
    "1. Know how to set up a [TPU strategy](https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/TPUStrategy?version=nightly) for training\n",
    "2. Know how to use a TensorFlow Hub Module when training on a TPU\n",
    "3. Know how to create and specify a TPU for training\n",
    "\n",
    "First things first. Configure the parameters below to match your own Google Cloud project details.\n",
    "\n",
    "Each learning objective will correspond to a __#TODO__ in the notebook, where you will complete the notebook cell's code before running the cell. Refer to the [solution notebook](../solutions/4_tpu_training.ipynb))for reference.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "os.environ[\"BUCKET\"] = \"your-bucket-here\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Packaging the Model\n",
    "In order to train on a TPU, we'll need to set up a python module for training. The skeleton for this has already been built out in `tpu_models` with the data processing functions from the pevious lab copied into <a href=\"tpu_models/trainer/util.py\">util.py</a>.\n",
    "\n",
    "Similarly, the model building and training functions are pulled into <a href=\"tpu_models/trainer/model.py\">model.py</a>. This is almost entirely the same as before, except the hub module path is now a variable to be provided by the user. We'll get into why in a bit, but first, let's take a look at the new `task.py` file.\n",
    "\n",
    "We've added five command line arguments which are standard for cloud training of a TensorFlow model: `epochs`, `steps_per_epoch`, `train_path`, `eval_path`, and `job-dir`. There are two new arguments for TPU training: `tpu_address` and `hub_path`\n",
    "\n",
    "`tpu_address` is going to be our TPU name as it appears in [Compute Engine Instances](https://console.cloud.google.com/compute/instances). We can specify this name with the [ctpu up](https://cloud.google.com/tpu/docs/ctpu-reference#up) command.\n",
    "\n",
    "`hub_path` is going to be a Google Cloud Storage path to a downloaded TensorFlow Hub module.\n",
    "\n",
    "The other big difference is some code to deploy our model on a TPU. To begin, we'll set up a [TPU Cluster Resolver](https://www.tensorflow.org/api_docs/python/tf/distribute/cluster_resolver/TPUClusterResolver), which will help tensorflow communicate with the hardware to set up workers for training ([more on TensorFlow Cluster Resolvers](https://www.tensorflow.org/api_docs/python/tf/distribute/cluster_resolver/ClusterResolver)). Once the resolver [connects to](https://www.tensorflow.org/api_docs/python/tf/config/experimental_connect_to_cluster) and [initializes](https://www.tensorflow.org/api_docs/python/tf/tpu/experimental/initialize_tpu_system) the TPU system, our Tensorflow Graphs can be initialized within a [TPU distribution strategy](https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/TPUStrategy), allowing our TensorFlow code to take full advantage of the TPU hardware capabilities."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**TODO**: Complete the code below to setup the `resolver` and define the TPU training strategy. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile tpu_models/trainer/task.py\n",
    "import argparse\n",
    "import json\n",
    "import os\n",
    "import sys\n",
    "\n",
    "import tensorflow as tf\n",
    "\n",
    "from . import model\n",
    "from . import util\n",
    "\n",
    "\n",
    "def _parse_arguments(argv):\n",
    "    \"\"\"Parses command-line arguments.\"\"\"\n",
    "    parser = argparse.ArgumentParser()\n",
    "    parser.add_argument(\n",
    "        '--epochs',\n",
    "        help='The number of epochs to train',\n",
    "        type=int, default=5)\n",
    "    parser.add_argument(\n",
    "        '--steps_per_epoch',\n",
    "        help='The number of steps per epoch to train',\n",
    "        type=int, default=500)\n",
    "    parser.add_argument(\n",
    "        '--train_path',\n",
    "        help='The path to the training data',\n",
    "        type=str, default=\"gs://cloud-ml-data/img/flower_photos/train_set.csv\")\n",
    "    parser.add_argument(\n",
    "        '--eval_path',\n",
    "        help='The path to the evaluation data',\n",
    "        type=str, default=\"gs://cloud-ml-data/img/flower_photos/eval_set.csv\")\n",
    "    parser.add_argument(\n",
    "        '--tpu_address',\n",
    "        help='The path to the evaluation data',\n",
    "        type=str, required=True)\n",
    "    parser.add_argument(\n",
    "        '--hub_path',\n",
    "        help='The path to TF Hub module to use in GCS',\n",
    "        type=str, required=True)\n",
    "    parser.add_argument(\n",
    "        '--job-dir',\n",
    "        help='Directory where to save the given model',\n",
    "        type=str, required=True)\n",
    "    return parser.parse_known_args(argv)\n",
    "\n",
    "\n",
    "def main():\n",
    "    \"\"\"Parses command line arguments and kicks off model training.\"\"\"\n",
    "    args = _parse_arguments(sys.argv[1:])[0]\n",
    "    \n",
    "    # TODO: define a TPU strategy\n",
    "    resolver = # TODO: Your code goes here\n",
    "    tf.config.experimental_connect_to_cluster(resolver)\n",
    "    tf.tpu.experimental.initialize_tpu_system(resolver)\n",
    "    strategy = # TODO: Your code goes here\n",
    "    \n",
    "    with strategy.scope():\n",
    "        train_data = util.load_dataset(args.train_path)\n",
    "        eval_data = util.load_dataset(args.eval_path, training=False)\n",
    "        image_model = model.build_model(args.job_dir, args.hub_path)\n",
    "\n",
    "    model_history = model.train_and_evaluate(\n",
    "        image_model, args.epochs, args.steps_per_epoch,\n",
    "        train_data, eval_data, args.job_dir)\n",
    "\n",
    "\n",
    "if __name__ == '__main__':\n",
    "    main()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The TPU server\n",
    "Before we can start training with this code, we need a way to pull in [MobileNet](https://tfhub.dev/google/imagenet/mobilenet_v2_035_224/feature_vector/4). When working with TPUs in the cloud, the TPU will [not have access to the VM's local file directory](https://cloud.google.com/tpu/docs/troubleshooting#cannot_use_local_filesystem) since the TPU worker acts as a server. Because of this **all data used by our model must be hosted on an outside storage system** such as Google Cloud Storage. This makes [caching](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#cache) our dataset especially critical in order to speed up training time.\n",
    "\n",
    "To access MobileNet with these restrictions, we can download a compressed [saved version](https://www.tensorflow.org/hub/tf2_saved_model) of the model by using the [wget](https://www.gnu.org/software/wget/manual/wget.html) command. Adding `?tf-hub-format=compressed` at the end of our module handle gives us a download URL."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget https://tfhub.dev/google/imagenet/mobilenet_v2_100_224/feature_vector/4?tf-hub-format=compressed"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This model is still compressed, so lets uncompress it with the `tar` command below and place it in our `tpu_models` directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "rm -r tpu_models/hub\n",
    "mkdir tpu_models/hub\n",
    "tar xvzf 4?tf-hub-format=compressed -C tpu_models/hub/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we need to transfer our materials to the TPU. We'll use GCS as a go-between, using [gsutil cp](https://cloud.google.com/storage/docs/gsutil/commands/cp) to copy everything."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!gcloud storage rm --recursive gs://$BUCKET/tpu_models\n",
    "!gcloud storage cp --recursive tpu_models gs://$BUCKET/tpu_models"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Spinning up a TPU\n",
    "Time to wake up a TPU! Open the [Google Cloud Shell](https://console.cloud.google.com/home/dashboard?cloudshell=true) and copy the [gcloud compute](https://cloud.google.com/sdk/gcloud/reference/compute/tpus/execution-groups/create) command below. Say 'Yes' to the prompts to spin up the TPU.\n",
    "\n",
    "`gcloud compute tpus execution-groups create \\\n",
    " --name=my-tpu \\\n",
    " --zone=us-central1-b \\\n",
    " --tf-version=2.3.2 \\\n",
    " --machine-type=n1-standard-1 \\\n",
    " --accelerator-type=v3-8`\n",
    "\n",
    "It will take about five minutes to wake up. Then, it should automatically SSH into the TPU, but alternatively [Compute Engine Interface](https://console.cloud.google.com/compute/instances) can be used to SSH in. You'll know you're running on a TPU when the command line starts with `your-username@your-tpu-name`.\n",
    "\n",
    "This is a fresh TPU and still needs our code. Run the below cell and copy the output into your TPU terminal to copy your model from your GCS bucket. Don't forget to include the `.` at the end as it tells gsutil to copy data into the currect directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!echo \"gcloud storage cp --recursive gs://$BUCKET/tpu_models .\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Time to shine, TPU! Run the below cell and copy the output into your TPU terminal. Training will be slow at first, but it will pick up speed after a few minutes once the Tensorflow graph has been built out."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**TODO**: Complete the code below by adding flags for `tpu_address` and the `hub_path`. Have another look at `task.py` to see how these flags are used. The `tpu_address` denotes the TPU you created above and `hub_path` should denote the location of the TFHub module. (Note that the training code requires a TPU_NAME environment variable, set in the first two lines below -- you may reuse it in your code.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "export TPU_NAME=my-tpu\n",
    "echo \"export TPU_NAME=\"$TPU_NAME\n",
    "echo \"python3 -m tpu_models.trainer.task \\\n",
    "    # TODO: Your code goes here \\\n",
    "    # TODO: Your code goes here \\\n",
    "    --job-dir=gs://$BUCKET/flowers_tpu_$(date -u +%y%m%d_%H%M%S)\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How did it go? In the previous lab, it took about 2-3 minutes to get through 25 images. On the TPU, it took 5-6 minutes to get through 2500. That's more than 40x faster! And now our accuracy is over 90%! Congratulations!\n",
    "\n",
    "Time to pack up shop. Run `exit` in the TPU terminal to close the SSH connection, and `gcloud compute tpus execution-groups delete my-tpu --zone=us-central1-b` in the [Cloud Shell](https://console.cloud.google.com/home/dashboard?cloudshell=true) to delete the Cloud TPU and Compute Engine instances. Alternatively, they can be deleted through the [Compute Engine Interface](https://console.cloud.google.com/compute/instances), but don't forget to separately delete the [TPU](https://console.cloud.google.com/compute/tpus) too!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright 2022 Google Inc.\n",
    "Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at\n",
    "http://www.apache.org/licenses/LICENSE-2.0\n",
    "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
   ]
  }
 ],
 "metadata": {
  "environment": {
   "name": "tf2-gpu.2-3.m69",
   "type": "gcloud",
   "uri": "gcr.io/deeplearning-platform-release/tf2-gpu.2-3:m69"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
